26 research outputs found

    Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation

    Full text link
    We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder--decoder pipeline, popular in image and video captioning systems. We propose to utilize two different kinds of video features, one to capture the video content in terms of objects and attributes, and the other to capture the motion and action information. Using these diverse features we train models specializing in two separate input sub-domains. We then train an evaluator model which is used to pick the best caption from the pool of candidates generated by these domain expert models. We argue that this approach is better suited for the current video captioning task, compared to using a single model, due to the diversity in the dataset. Efficacy of our method is proven by the fact that it was rated best in MSR Video to Language Challenge, as per human evaluation. Additionally, we were ranked second in the automatic evaluation metrics based table

    Natural Language Description of Images and Videos

    Get PDF
    Understanding visual media, i.e. images and videos, has been a cornerstone topic in computer vision research for a long time. Recently, a new task within the purview of this research area, that of automatically captioning images and videos, has garnered wide-spread interest. The task involves generating a short natural language description of an image or a video. This thesis studies the automatic visual captioning problem in its entirety. A baseline visual captioning pipeline is examined, including its two constituent blocks, namely visual feature extraction and language modeling. We then discuss the challenges involved and the methods available to evaluate a visual captioning system. Building on this baseline model, several enhancements are proposed to improve the performance of both the visual feature extraction and the language modeling. Deep convolutional neural network based image features used in the baseline model are augmented with explicit object and scene detection features. In the case of videos, a combination of action recognition and static frame-level features are used. The long-short term memory network based language model used in the baseline is extended by introduction of an additional input channel and residual connections. Finally, an efficient ensembling technique based on a caption evaluator network is presented. Results from extensive experiments conducted to evaluate each of the above mentioned enhancements are reported. The image and video captioning architectures proposed in this thesis achieve state-of-the-art performance on the corresponding tasks. To support these claims, results from two video captioning challenges organized over the last year are reported, both of which were won by the models presented in the thesis. We also quantitatively analyze the automatic captions generated and identify several shortcomings of the current system. After having identified the deficiencies, we briefly look at a few interesting problems which could take the automatic visual captioning research forward

    Not Using the Car to See the Sidewalk: Quantifying and Controlling the Effects of Context in Classification and Segmentation

    Full text link
    Importance of visual context in scene understanding tasks is well recognized in the computer vision community. However, to what extent the computer vision models for image classification and semantic segmentation are dependent on the context to make their predictions is unclear. A model overly relying on context will fail when encountering objects in context distributions different from training data and hence it is important to identify these dependencies before we can deploy the models in the real-world. We propose a method to quantify the sensitivity of black-box vision models to visual context by editing images to remove selected objects and measuring the response of the target models. We apply this methodology on two tasks, image classification and semantic segmentation, and discover undesirable dependency between objects and context, for example that "sidewalk" segmentation relies heavily on "cars" being present in the image. We propose an object removal based data augmentation solution to mitigate this dependency and increase the robustness of classification and segmentation models to contextual variations. Our experiments show that the proposed data augmentation helps these models improve the performance in out-of-context scenarios, while preserving the performance on regular data.Comment: 14 pages (12 figures

    Adversarial content manipulation for analyzing and improving model robustness

    Get PDF
    The recent rapid progress in machine learning systems has opened up many real-world applications --- from recommendation engines on web platforms to safety critical systems like autonomous vehicles. A model deployed in the real-world will often encounter inputs far from its training distribution. For example, a self-driving car might come across a black stop sign in the wild. To ensure safe operation, it is vital to quantify the robustness of machine learning models to such out-of-distribution data before releasing them into the real-world. However, the standard paradigm of benchmarking machine learning models with fixed size test sets drawn from the same distribution as the training data is insufficient to identify these corner cases efficiently. In principle, if we could generate all valid variations of an input and measure the model response, we could quantify and guarantee model robustness locally. Yet, doing this with real world data is not scalable. In this thesis, we propose an alternative, using generative models to create synthetic data variations at scale and test robustness of target models to these variations. We explore methods to generate semantic data variations in a controlled fashion across visual and text modalities. We build generative models capable of performing controlled manipulation of data like changing visual context, editing appearance of an object in images or changing writing style of text. Leveraging these generative models we propose tools to study robustness of computer vision systems to input variations and systematically identify failure modes. In the text domain, we deploy these generative models to improve diversity of image captioning systems and perform writing style manipulation to obfuscate private attributes of the user. Our studies quantifying model robustness explore two kinds of input manipulations, model-agnostic and model-targeted. The model-agnostic manipulations leverage human knowledge to choose the kinds of changes without considering the target model being tested. This includes automatically editing images to remove objects not directly relevant to the task and create variations in visual context. Alternatively, in the model-targeted approach the input variations performed are directly adversarially guided by the target model. For example, we adversarially manipulate the appearance of an object in the image to fool an object detector, guided by the gradients of the detector. Using these methods, we measure and improve the robustness of various computer vision systems -- specifically image classification, segmentation, object detection and visual question answering systems -- to semantic input variations.Der schnelle Fortschritt von Methoden des maschinellen Lernens hat viele neue Anwendungen ermöglicht – von Recommender-Systemen bis hin zu sicherheitskritischen Systemen wie autonomen Fahrzeugen. In der realen Welt werden diese Systeme oft mit Eingaben außerhalb der Verteilung der Trainingsdaten konfrontiert. Zum Beispiel könnte ein autonomes Fahrzeug einem schwarzen Stoppschild begegnen. Um sicheren Betrieb zu gewährleisten, ist es entscheidend, die Robustheit dieser Systeme zu quantifizieren, bevor sie in der Praxis eingesetzt werden. Aktuell werden diese Modelle auf festen Eingaben von derselben Verteilung wie die Trainingsdaten evaluiert. Allerdings ist diese Strategie unzureichend, um solche Ausnahmefälle zu identifizieren. Prinzipiell könnte die Robustheit “lokal” bestimmt werden, indem wir alle zulässigen Variationen einer Eingabe generieren und die Ausgabe des Systems überprüfen. Jedoch skaliert dieser Ansatz schlecht zu echten Daten. In dieser Arbeit benutzen wir generative Modelle, um synthetische Variationen von Eingaben zu erstellen und so die Robustheit eines Modells zu überprüfen. Wir erforschen Methoden, die es uns erlauben, kontrolliert semantische Änderungen an Bild- und Textdaten vorzunehmen. Wir lernen generative Modelle, die kontrollierte Manipulation von Daten ermöglichen, zum Beispiel den visuellen Kontext zu ändern, die Erscheinung eines Objekts zu bearbeiten oder den Schreibstil von Text zu ändern. Basierend auf diesen Modellen entwickeln wir neue Methoden, um die Robustheit von Bilderkennungssystemen bezüglich Variationen in den Eingaben zu untersuchen und Fehlverhalten zu identifizieren. Im Gebiet von Textdaten verwenden wir diese Modelle, um die Diversität von sogenannten Automatische Bildbeschriftung-Modellen zu verbessern und Schreibtstil-Manipulation zu erlauben, um private Attribute des Benutzers zu verschleiern. Um die Robustheit von Modellen zu quantifizieren, werden zwei Arten von Eingabemanipulationen untersucht: Modell-agnostische und Modell-spezifische Manipulationen. Modell-agnostische Manipulationen basieren auf menschlichem Wissen, um bestimmte Änderungen auszuwählen, ohne das entsprechende Modell miteinzubeziehen. Dies beinhaltet das Entfernen von für die Aufgabe irrelevanten Objekten aus Bildern oder Variationen des visuellen Kontextes. In dem alternativen Modell-spezifischen Ansatz werden Änderungen vorgenommen, die für das Modell möglichst ungünstig sind. Zum Beispiel ändern wir die Erscheinung eines Objekts um ein Modell der Objekterkennung täuschen. Dies ist durch den Gradienten des Modells möglich. Mithilfe dieser Werkzeuge können wir die Robustheit von Systemen zur Bildklassifizierung oder -segmentierung, Objekterkennung und Visuelle Fragenbeantwortung quantifizieren und verbessern

    Adversarial Scene Editing: Automatic Object Removal from Weak Supervision

    Full text link
    While great progress has been made recently in automatic image manipulation, it has been limited to object centric images like faces or structured scene datasets. In this work, we take a step towards general scene-level image editing by developing an automatic interaction-free object removal model. Our model learns to find and remove objects from general scene images using image-level labels and unpaired data in a generative adversarial network (GAN) framework. We achieve this with two key contributions: a two-stage editor architecture consisting of a mask generator and image in-painter that co-operate to remove objects, and a novel GAN based prior for the mask generator that allows us to flexibly incorporate knowledge about object shapes. We experimentally show on two datasets that our method effectively removes a wide variety of objects using weak supervision onl

    Dunbar syndrome: a rare presentation of abdominal angina treated by revascularization of the celiac artery by endovascular stenting

    Get PDF
    Median arcuate ligament syndrome (MALS) is a rare entity characterized by extrinsic compression of the celiac artery and symptoms of postprandial epigastric pain, nausea, vomiting, and weight loss mimicking mesenteric ischemia. The following case illustrates a rare cause of abdominal pain, where this young woman was found to have celiac trunk stenosis , secondary to compression of the trunk by the median arcuate ligament. She underwent a successful stenting to the ostial celiac trunk, thus reliving her symptomatically. Decompression of the celiac artery is the general approach. Usually post PTA, once revascularisation is achieved, 75% of the patients remain asymptomatic at follow up

    A4NT : Author Attribute Anonymity by Adversarial Training of Neural Machine Translation

    Get PDF
    Text-based analysis methods allow to reveal privacy relevant author attributes such as gender, age and identify of the text's author. Such methods can compromise the privacy of an anonymous author even when the author tries to remove privacy sensitive content. In this paper, we propose an automatic method, called Adversarial Author Attribute Anonymity Neural Translation (A4NTA^4NT), to combat such text-based adversaries. We combine sequence-to-sequence language models used in machine translation and generative adversarial networks to obfuscate author attributes. Unlike machine translation techniques which need paired data, our method can be trained on unpaired corpora of text containing different authors. Importantly, we propose and evaluate techniques to impose constraints on our A4NTA^4NT to preserve the semantics of the input text. A4NTA^4NT learns to make minimal changes to the input text to successfully fool author attribute classifiers, while aiming to maintain the meaning of the input. We show through experiments on two different datasets and three settings that our proposed method is effective in fooling the author attribute classifiers and thereby improving the anonymity of authors.Comment: 16 pages, 10 figures and 8 table

    Integrating artificial intelligence for knowledge management systems – synergy among people and technology: a systematic review of the evidence

    Get PDF
    This paper analyses Artificial Intelligence (AI) and Knowledge Management (KM) and focuses primarily on examining to what degree AI can help companies in their efforts to handle information and manage knowledge effectively. A search was carried out for relevant electronic bibliographic databases and reference lists of relevant review articles. Articles were screened and the eligibility was based on participants, procedures, comparisons, outcomes (PICO) model, and criteria for PRISMA (Preferred Reporting Items for Systematic Reviews). The results reveal that knowledge management and AI are interrelated fields as both are intensely connected to knowledge; the difference reflects in how – while AI offers machines the ability to learn, KM offers a platform to better understand knowledge. The research findings further point out that communication, trust, information systems, incentives or rewards, and the structure of an organization; are related to knowledge sharing in organizations. This systematic literature review is the first to throw light on KM practices & the knowledge cycle and how the integration of AI aids knowledge management systems, enterprise performance & distribution of knowledge within the organization. The outcomes offer a better understanding of efficient and effective knowledge resource management for organizational advantage. Future research is necessary on smart assistant systems thus providing social benefits that strengthen competitive advantage. This study indicates that organizations must take note of definite KM leadership traits and organizational arrangements to achieve stable performance through KM

    Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training

    Get PDF
    While strong progress has been made in image captioning over the last years, machine and human captions are still quite distinct. A closer look reveals that this is due to the deficiencies in the generated word distribution, vocabulary size, and strong bias in the generators towards frequent captions. Furthermore, humans -- rightfully so -- generate multiple, diverse captions, due to the inherent ambiguity in the captioning task which is not considered in today's systems. To address these challenges, we change the training objective of the caption generator from reproducing groundtruth captions to generating a set of captions that is indistinguishable from human generated captions. Instead of handcrafting such a learning target, we employ adversarial training in combination with an approximate Gumbel sampler to implicitly match the generated distribution to the human one. While our method achieves comparable performance to the state-of-the-art in terms of the correctness of the captions, we generate a set of diverse captions, that are significantly less biased and match the word statistics better in several aspects.Comment: 16 pages, Published in ICCV 201
    corecore